Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

نویسندگان

  • Christian Reul
  • Uwe Springmann
  • Christoph Wick
  • Frank Puppe
چکیده

We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving OCR Accuracy on Early Printed Books by utilizing Cross Fold Training and Voting

In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The O...

متن کامل

Improving OCR Accuracy on Early Printed Books using Deep Convolutional Networks

This paper proposes a combination of a convolutional and a LSTM network to improve the accuracy of OCR on early printed books. While the standard model of line based OCR uses a single LSTM layer, we utilize a CNNand Pooling-Layer combination in advance of an LSTM layer. Due to the higher amount of trainable parameters the performance of the network relies on a high amount of training examples t...

متن کامل

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural Networkbased Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus (Breuel et al. 2013) on the RIDGES herbal text corpus (Odebrecht et al. 2017, in press). Training specific OCR models was possible because the necessary ground truth is available as err...

متن کامل

A Modfied Self-organizing Map Neural Network to Recognize Multi-font Printed Persian Numerals (RESEARCH NOTE)

This paper proposes a new method to distinguish the printed digits, regardless of font and size, using neural networks.Unlike our proposed method, existing neural network based techniques are only able to recognize the trained fonts. These methods need a large database containing digits in various fonts. New fonts are often introduced to the public, which may not be truly recognized by the Opti...

متن کامل

Transfer Learning for OCRopus Model Training on Early Printed Books

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1802.10038  شماره 

صفحات  -

تاریخ انتشار 2018